This notebook demonstrates how to use the Corpus
objects to train and view a term frequency (TF) or "word count" model, a term frequency-inverse document frequency (tf-idf) model, the Latent Semantic Analysis (LSA) model, and view topics in the LDA models trained throught the topic explorer.
To run the notebook, use the menu Cell -> Run All
or use the play button for one cell at a time.
In [ ]:
# First we load the vsm module and import your corpus
from vsm import *
from corpus import *
The term frequency model is the most primitive of all vector space models. For each document, we simply count the number of occurrences of a word.
In [ ]:
# train and initialize the Term Frequency Model
tf = TF(c, context_type)
tf.train()
tf_v = TfViewer(c, tf)
In [ ]:
# print the most frequent terms in the document
tf_v.coll_freqs()
In [ ]:
#train and intialize the tf-idf model
tfidf = TfIdf.from_tf(tf)
tfidf.train()
tfidf_v = TfIdfViewer(c, tfidf)
In [ ]:
lsa = Lsa.from_tf(tf)
lsa.train(k_factors=50)
lsa_v = LsaViewer(c, lsa)
The LDA models are automatically imported in the dictionary lda_v[k]
where k is the number of topics.
In [ ]:
lda_v[20].topics()
In [ ]:
v=lda_v[20]
In [ ]:
# show the difference in document relations across documents
import numpy as np
words = ['moral']
b=np.array(v.dist_word_top(words, show_topics=False))
sim_docs = [tf_v.dist_word_doc(words),
tfidf_v.dist_word_doc(words),
lsa_v.dist_word_doc(words),
v.dist_top_doc(b['i'], weights=np.ones_like(b['value']) - b['value'])]
[docs[:10]['doc'] for docs in sim_docs]